Skip to content

feat(benchmark): gateway benchmark harness (footprint, scaling, config, heap)#64

Merged
bburda merged 7 commits into
mainfrom
feat/benchmark-harness
Jun 21, 2026
Merged

feat(benchmark): gateway benchmark harness (footprint, scaling, config, heap)#64
bburda merged 7 commits into
mainfrom
feat/benchmark-harness

Conversation

@bburda

@bburda bburda commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What this gives

A benchmark for the gateway's runtime cost that points at what to optimize AND tracks whether a change improved or regressed it. A Python orchestrator drives docker compose and samples the gateway process via /proc (USS/PSS, CPU-cores), with repeats and confidence intervals - not single readings.

python -m benchmark.benchmark footprint --duration 300 --repeats 5
python -m benchmark.benchmark scaling --entities 10,30,60,100,150,250
python -m benchmark.benchmark sweep --entities 50
python -m benchmark.benchmark load --entities 30
python -m benchmark.benchmark fault --faults 1,2,4,8,16
export ROS2_MEDKIT_REF=<sha>            # pin the gateway commit to benchmark
python -m benchmark.benchmark compare --run latest --baseline benchmark/baseline/<host>.json

Each lane writes a table, a chart, and a JSON summary with a verdict line. benchmark/README.md documents the method and metrics.

Lanes

  • footprint / scaling / sweep - steady-state memory and CPU on the real Nav2 demo and a synthetic graph; scaling fits the growth curve with a CI.
  • heap / memcheck - heaptrack heap growth + call-sites; valgrind definitely-lost. scripts/heap_on_nav2.sh runs a long heaptrack on the real Nav2 stack (debug-symbol gateway) - the tracked heap plateaus, so the gateway does not leak on Nav2.
  • load - footprint, CPU, thread breakdown and request p50/p95 latency under M concurrent HTTP clients (holds a real SSE stream; composes onto footprint via --load).
  • fault - snapshot-capture impact as a burst (peak memory/CPU, capture duration, recovery) vs fault count, a fresh container per N.

Regression tracking ("did we improve?")

  • Every run records the gateway SHA (ROS2_MEDKIT_REF, pinnable through Compose), demo image digest, host CPU/RAM/allocator, and a high-load flag.
  • compare diffs a run against a committed baseline/<host>.json, refuses cross-machine or high-load runs, and exits non-zero on regression (USS +10%, CPU +15%, scaling exponent CI crossing 1.0). update-baseline re-pins after a confirmed improvement.
  • A workflow_dispatch + weekly CI job (self-hosted runner) benchmarks a pinned medkit ref and fails on regression.

Built to not overclaim

  • Steady-state is enforced (a still-rising run is flagged not-steady and excluded; the report shows the steady/total count).
  • Scaling verdict is CI-gated (sub/super-linear only when the CI clears 1, else INDETERMINATE; degenerate small-n fits forced to INDETERMINATE; per refresh rate, not pooled).
  • The leak slope CI is autocorrelation-corrected; a positive /proc slope without heaptrack call-sites is inconclusive, not a leak; the heap lane discloses that /proc USS under heaptrack is inflated.
  • The fault lane uses a fresh container per N (clean baseline); the load lane reports median latency across repeats.

What it found (one host, illustrative)

  • Footprint on real Nav2: gateway USS ~95-100 MiB, ~0.2-0.3 CPU-cores.
  • Scaling: USS ~ entities^0.46, CI [0.26, 0.65] - sub-linear confirmed.
  • Config: discovery refresh interval is the main CPU lever (200 ms ~4x the 1000 ms default).
  • Load: ~50 threads (~39 = executor + httplib pool), CPU 18x idle->heavy, p95 2.3 ms.
  • Fault: snapshot-capture peak grows monotonically with N (~0.5 -> ~5.8 MiB at N=16); recovers only for N<=2.
  • Heap: no leak on Nav2 (tracked heap plateaus over a 25-min heaptrack run).

Notes

Synthetic lanes run the gateway and graph (or fault_manager) in one container (the Docker bridge does not forward DDS multicast) and build a debug-symbol gateway image for heap/leak work. Runs on plain Docker (probing via docker exec also covers docker-out-of-docker). Unit tests: 158. The CI job needs a fixed self-hosted runner so the host-keyed baseline stays valid.

Related Issue

n/a

Checklist

  • Tested locally
  • README updated (benchmark/README.md)

Copilot AI review requested due to automatic review settings June 18, 2026 14:23

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new benchmark/ Python-based harness to measure the ROS2 Medkit gateway’s runtime cost (memory footprint, scaling behavior, config sweep impact, and heap/leak signals) by orchestrating Docker Compose runs and sampling /proc metrics, with accompanying report/chart generation and unit tests for the pure parsing/aggregation logic.

Changes:

  • Introduces a benchmark CLI (python -m benchmark.benchmark) with lanes: footprint, scaling, sweep, heap, memcheck, attribute, and report aggregation.
  • Adds a synthetic ROS 2 graph generator (rclpy) plus Docker Compose + Dockerfile tooling to run gateway + graph in a single container.
  • Adds a substantial pure-Python library layer for sampling/parsing/metrics/reporting, covered by unit tests and fixtures.

Reviewed changes

Copilot reviewed 48 out of 52 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
benchmark/benchmark.py Main CLI orchestrator for all benchmark lanes; run directory management, aggregation, and reporting.
benchmark/turtlebot3.py Demo wiring/config for the turtlebot3 integration benchmark target.
benchmark/README.md Usage docs, prerequisites, lane descriptions, and quickstart commands.
benchmark/requirements.txt Python dependencies for report/chart generation and tests.
benchmark/.dockerignore Excludes results and caches from Docker build context.
benchmark/configs/overrides.yaml Config override sets used for the sweep lane.
benchmark/lib/config_sweep.py Pure helpers for merging and applying param overrides at the gateway namespace root.
benchmark/lib/docker_helpers.py Docker/Compose wrappers for starting services, exec’ing commands, and reading /proc files.
benchmark/lib/gateway_client.py JSON parsing helper for collection endpoints (items-count).
benchmark/lib/leak_parse.py Pure parsers for heaptrack and valgrind memcheck summaries.
benchmark/lib/metrics.py Pure numeric/stat helpers (median/IQR/linfit/slope CI/log-log exponent/steady window).
benchmark/lib/report.py Repeat aggregation + lane verdict logic and markdown/chart renderers.
benchmark/lib/runner.py Shared “cell runner” logic: start container, warmup, sample window, summarize.
benchmark/lib/runmeta.py Captures run metadata (host, kernel, allocator, image digest, etc.).
benchmark/lib/sampler.py /proc sampling and parsing (smaps_rollup, status, stat) + CSV writing.
benchmark/lib/warmup.py Warmup predicates (entity-count stability + USS derivative threshold).
benchmark/lib/init.py Package marker for benchmark.lib.
benchmark/scaler/spawn_nodes.py Synthetic graph planning (node/topic/service/param specs).
benchmark/scaler/synthetic_graph.py rclpy-based synthetic graph host that publishes and exposes services.
benchmark/scaler/init.py Package marker for benchmark.scaler.
benchmark/profiles/synthetic.compose.yml Docker Compose profile to run gateway + synthetic graph in one container.
benchmark/profiles/Dockerfile.benchmark Benchmark image build (ROS Jazzy, tools, clone/build ros2_medkit).
benchmark/profiles/run_gateway_and_graph.sh Container entrypoint to start synthetic graph and gateway (optionally under heaptrack/valgrind).
benchmark/profiles/fastdds.supp Valgrind suppressions for FastDDS-related shutdown noise.
benchmark/tests/test_cli_wiring.py CLI help/subcommand presence test.
benchmark/tests/test_config_sweep.py Unit tests for deep-merge and override application behavior.
benchmark/tests/test_docker_helpers.py Unit tests for PID parsing error cases.
benchmark/tests/test_gateway_client.py Unit tests for items-count JSON parsing.
benchmark/tests/test_heap_report.py Unit tests for heap report markdown rendering.
benchmark/tests/test_leak_parse.py Unit tests for heaptrack/memcheck summary parsing.
benchmark/tests/test_memcheck_report.py Unit tests for memcheck report markdown rendering.
benchmark/tests/test_metrics.py Unit tests for numeric/stat helpers.
benchmark/tests/test_overrides_load.py Unit tests for loading override sets YAML.
benchmark/tests/test_report_aggregate.py Unit tests for repeat aggregation + verdict helpers.
benchmark/tests/test_report_render.py Unit tests for footprint markdown rendering formatting/contents.
benchmark/tests/test_runner_summary.py Unit tests for window summarization output keys and sanity.
benchmark/tests/test_runmeta.py Unit tests for required run metadata fields.
benchmark/tests/test_sampler_loop.py Unit tests for sampling loop helpers and CPU-cores derivation.
benchmark/tests/test_sampler_parse.py Unit tests for /proc parsing routines with fixtures.
benchmark/tests/test_scaler_plan.py Unit tests for synthetic graph planning (counts, uniqueness, cardinality).
benchmark/tests/test_scaling_rows.py Unit tests for scaling row derivation (USS per entity).
benchmark/tests/test_validation.py Unit tests for synthetic-vs-demo validation messaging.
benchmark/tests/test_warmup.py Unit tests for warmup predicate helpers.
benchmark/tests/fixtures/smaps_rollup.txt Fixture for smaps_rollup parsing tests.
benchmark/tests/fixtures/stat.txt Fixture for /proc/<pid>/stat parsing tests.
benchmark/tests/fixtures/status.txt Fixture for /proc/<pid>/status parsing tests.
benchmark/tests/fixtures/memcheck.txt Fixture for valgrind memcheck parsing tests.
benchmark/tests/fixtures/heaptrack_print.txt Fixture for heaptrack_print parsing tests.
benchmark/tests/fixtures/medkit_params.yaml Fixture for params override application tests.
benchmark/tests/init.py Package marker for benchmark.tests.
benchmark/init.py Package marker for benchmark.
.gitignore Ignores benchmark results output, baked params, and benchmark pyc/caches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread benchmark/tests/test_cli_wiring.py Outdated
Comment thread benchmark/lib/docker_helpers.py
Comment thread benchmark/benchmark.py
Comment thread benchmark/lib/report.py Outdated
Comment thread benchmark/benchmark.py
Comment thread benchmark/turtlebot3.py Outdated
@bburda bburda force-pushed the feat/benchmark-harness branch 2 times, most recently from 6242b91 to b9997a7 Compare June 18, 2026 14:53
@bburda bburda marked this pull request as draft June 18, 2026 15:00
@bburda bburda force-pushed the feat/benchmark-harness branch 6 times, most recently from cea132c to 4d43cc1 Compare June 19, 2026 17:14
@bburda bburda self-assigned this Jun 19, 2026
@bburda bburda requested review from Copilot and mfaferek93 June 19, 2026 17:27

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 66 out of 71 changed files in this pull request and generated 6 comments.

Comment thread benchmark/tests/test_cli_wiring.py
Comment thread benchmark/benchmark.py
Comment thread benchmark/lib/fault_injector.py
Comment thread benchmark/lib/fault_injector.py
Comment thread benchmark/turtlebot3.py Outdated
Comment thread .github/workflows/benchmark.yml
bburda added 6 commits June 19, 2026 19:33
/proc USS/PSS/CPU sampling, the Student-t + AR(1) statistics engine, run metadata,
config-override loading and leak/memcheck log parsing, with unit tests.
…compare

Fresh-container cell runner with the enforced warm-up gate, median/IQR aggregation and
CI-gated report rendering, the transient burst sampler, and the baseline-diff engine.
…r profiles

Synthetic ROS 2 graph + HTTP load generator, the fault_manager injector, and the
single-container gateway/graph/fault images and entrypoints.
The orchestrator CLI wiring every lane plus all/report, and the harness README
documenting the method, metrics and lanes.
Rebuilds the gateway with debug symbols and runs it under heaptrack attached to the
real Nav2 graph; the tracked heap plateaus, so the gateway does not leak on Nav2.
Pin the gateway commit via ROS2_MEDKIT_REF through Compose, capture the SHA in the demo
image, seed a host-keyed baseline, and add a dispatch+weekly CI job that compares a run
against it and fails on regression.
@bburda bburda force-pushed the feat/benchmark-harness branch from 4d43cc1 to e8075e4 Compare June 19, 2026 17:34
Comment thread benchmark/benchmark.py Outdated
Comment thread benchmark/benchmark.py Outdated
Comment thread benchmark/lib/compare.py Outdated
Comment thread benchmark/benchmark.py Outdated
Comment thread .github/workflows/benchmark.yml Outdated
Comment thread benchmark/lib/burst.py Outdated
Comment thread benchmark/scaler/load_gen.py Outdated
Comment thread benchmark/lib/fault_injector.py Outdated
Comment thread benchmark/scripts/heap_on_nav2.sh Outdated
Comment thread benchmark/tests/test_cli_wiring.py Outdated
@bburda bburda marked this pull request as ready for review June 20, 2026 08:05
Add a churn lane that gates gateway memory growth under ROS graph churn
(static vs churning-graph USS slope, PASS/FAIL, exit 1 on leak), plus a
synthetic-graph churn mode (BENCH_CHURN_SEC / BENCH_CHURN_COUNT).

Honesty and robustness fixes so lanes report real data instead of silent zeros:
- memcheck: run valgrind directly on gateway_node (not the ros2 launcher),
  capture stderr, gate on readiness, poll for the LEAK SUMMARY; fix the
  malformed fastdds.supp that made valgrind abort at startup
- heap: bash pipefail and fail loud when heaptrack produces no summary
- heap_on_nav2.sh: two-phase (clean USS without heaptrack as the leak verdict,
  short heaptrack pass for call-site attribution) with an OLS slope CI
- scaling regression gate: baseline-relative CI comparison instead of an
  absolute ci_lo>1 threshold; absolute floor only when baseline is absent
- compare: gate high host load across all lanes, not just the first
- docker_helpers: per-call subprocess timeouts and curl --max-time /
  --connect-timeout; accept any positive CLK_TCK; merge_stderr option
- sampler: tolerate transient /proc read errors, stop after the process is gone
- burst: take clk_tck as a parameter; require USS to leave the band before
  declaring recovery
- load_gen: include timed-out requests in tail latency, report error_rate,
  stop on SIGTERM
- fault lane: mark failed cells and exclude them from the table, chart and
  optimization signals instead of rendering fabricated zeros
- runner: warm the gateway under load for the load lane; treat an empty sample
  window as a failed cell
- report: leak verdict wording (leaked-at-exit, not "heap grew")
- cmd_report / _latest_run_dir: clear error on missing or empty results dir
- cmd_load: per-level thread census
- turtlebot3: override_root typed as list[str]

Tooling:
- --run-dir to write several lanes into one shared run dir for a single compare
- CI runs the harness unit tests on a GitHub-hosted runner
- portable test working directory; docs and unit tests for all of the above

@mfaferek93 mfaferek93 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bburda bburda merged commit 471ca2c into main Jun 21, 2026
5 checks passed
@bburda bburda deleted the feat/benchmark-harness branch June 21, 2026 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants